multinomial distribution

2022-03-08 ยท 3 min read

    \[ \renewcommand{\Pr}[1]{\text{Pr}\left[ #1 \right]} \def\eps{\varepsilon} \def\E{\mathbb{E}} \]

    Example #

    You have a k-sided, weighted die. The weight each face $i = 1, \ldots, k$ is $p_i$. You roll the die $n$ times. The distribution of faces rolled will follow a Multinomial Distribution.

    Overview #

    PropertyValue
    Parameters$n > 0$ number of trials, $p_1, \ldots, p_k$ event probabilities, where $\sum_{i} p_i = n$
    Support$i \in \set{1, \ldots, k}, ; x_i \in \set{0, \ldots, k}, \text{ with } \sum_i x_i = n$
    PMF$\Pr{X = \set{x_1, \ldots, x_k}} = \frac{n!}{x_1! \cdots x_n!} p_1^{x_1} \cdots p_k^{x_k}$
    Mean$\E[X_i] = n \cdot p_i$
    Variance$Var(X_i) = n p_i (1 - p_i), ; Cov(X_i, X_j) = -n p_i p_j ; (i \ne j)$

    Statistical Inference #

    Equivalence Tests #

    Ostrovski 2017 - Testing equivalence of multinomial distributions

    Wikiwand - Multinomial Test

    Resin 2021 - A Simple Algorithm for Exact Multinomial Tests

    Goodness-of-fit Tests #

    Ideally, we could use an exact test, where we check if the observed counts $X$ lie in some failure region (according to some criterion) with some confidence probability $p_{significance}$, for example. The criterion will claim the $X$ is not the same as our hypothesized distribution if $X$ lies in the failure region. We choose the size of our failure region based on our initial choice of confidence (greater required confidence implies greater sized failure region).

    \[ p_{significance} = \sum_{X \;\in\; \text{region}} \text{multinomial-pmf}(X) \]

    Unfortunately, the multinomial CDF and PMF have an absurdly huge support for any non-trivial $n$ and $k$, and we need to calculate the CDF by just exhaustively computing the joint probabilities, which makes this approach computationally infeasible, since any non-trivial region is just too big. :'(

    Instead, we need derived random variables, called test statistics, which are more mathematically convenient or more computationally tractable, and compare their observed values against the failure regions projected into the test-statistic's distribution.

    G-test - multinomial goodness-of-fit

    Wikiwand - Pearson's chi-squared test

    Wikiwand - G-test / log-likelihood ratio test

    Example Implementation #

    Using g_test defined in G-test - multinomial goodness-of-fit > Example Implementation.

    use claim::{debug_assert_ge, debug_assert_le};
    use ndarray::ArrayView1;
    use statrs::distribution::{ChiSquared, ContinuousCDF};
    
    /// A goodness-of-fit test between a hypothesized multinomial distribution, `p`,
    /// and an experimentally observed distribution, `p_hat`, both represented as
    /// dense PMFs. `n` is the number of samples taken to construct `p_hat`.
    ///
    /// Returns a p-value, `Pr[G(x) >= p-value | H_0: x ~ p]`.
    fn multinomial_test(n: usize, p: ArrayView1<f64>, p_hat: ArrayView1<f64>) -> f64 {
        // want to compute the DOF (nnz of p)
        let nnz = p.fold(0.0, |nnz, &x| nnz + if x > 0.0 { 1.0 } else { 0.0 });
        let dof = nnz - 1.0;
    
        debug_assert_le!(nnz, p.dim() as f64);
        debug_assert_ge!(dof, 1.0);
    
        // impossible to draw p_hat from p
        if !is_pmf_subset(p_hat, p) {
            return 0.0;
        }
    
        let g = g_test(n, p, p_hat);
        let pvalue = 1.0 - chisq_cdf(dof, g);
    
        println!(
            "multinomial_test: n: {n}, |p|: {}, dof: {dof}, g: {g}, p-value: {pvalue}",
            p.dim()
        );
    
        pvalue
    }